Standard error and bootstrapped confidence intervals

Author: Leonardo Espin

Date: 5/16/2019

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
In [2]:
#create a population from a normal distribution
mu, sigma = 3.15, 1.37
population = np.random.normal(mu, sigma, 10000)
In [3]:
#get a single sample which will be bootstrapped
thesample=population[np.random.randint(low = 0, high = population.size -1, size = 100)]
sns.distplot(thesample,kde=False)
print('sample mean = {}'.format(thesample.mean()))
print('sample standard deviation = {}'.format(thesample.std()))
sample mean = 3.17492425931606
sample standard deviation = 1.4521702308326077
In [4]:
#compute the standard error for the population mean from multiple samples
means=[]
n=population.size -1
#getting the mean from multiple population samples
for i in range(100):
    tmpsample=population[np.random.randint(low = 0, high =n , size = 100)]
    means.append(tmpsample.mean())
    
#plotting the means of each sample, mean of means and the standard error
mm=np.mean(means)
se=np.std(means)
plt.eventplot(means)
plt.eventplot([mm-se,mm,mm+se],colors='r',linewidths=5)
print('mean of means = {}'.format(mm))
print('Standard error of the mean = {}'.format(se))
mean of means = 3.1462583889438225
Standard error of the mean = 0.14322253766270088

Assume that each sample is obtained from some experimental measurement that is expensive to perform. In that case it makes more sense to use bootstrapping for computing the standard error, or confidence intervals:

In [6]:
#bootstrapping a single sample
bootMeans=[]
for i in range(100000):
    tmpsample=thesample[np.random.randint(low = 0, high =99 , size = 100)]
    bootMeans.append(tmpsample.mean())
In [7]:
#plotting the bootstrapped means and standard error
mm=np.mean(bootMeans)
se=np.std(bootMeans)
plt.eventplot(bootMeans[0:len(bootMeans)-1:100])#down-sampled
plt.eventplot([mm-se,mm,mm+se],colors='r',linewidths=5)
print('bootstrapped mean = {}'.format(mm))
print('Bootstrapped standard error of the mean = {}'.format(se))
bootstrapped mean = 3.164895756598803
Bootstrapped standard error of the mean = 0.14565014411336885

The 95% confidence interval for the population mean constructed from the bootstrapped samples is:

In [20]:
bootMeans.sort()
m=len(bootMeans)
low=bootMeans[int(0.025*m)]
high=bootMeans[int((1-0.025)*m)]
print('95% confidence interval for the population mean = [{:0.2f}, {:0.2f}]'.format(low,high))
95% confidence interval for the population mean = [2.88, 3.45]

Since two standard deviations on each side of the mean cover around 95% of the data, the bootstrapped standard error, should be about 1/4 of the width of the bootstrapped confidence interval. Indeed:

In [23]:
print('Standard error X 4 is {:0.2f}'.format(se*4))
print('Length of the confidence interval is {:0.2f}'.format(high-low))
Standard error X 4 is 0.58
Length of the confidence interval is 0.57

Notice that we could build a similar interval containing the 95% of sample means:

In [24]:
means.sort()
m=len(means)
low=means[int(0.025*m)]
high=means[int((1-0.025)*m)]
print('interval with 95% of sample means = [{:0.2f}, {:0.2f}]'.format(low,high))
interval with 95% of sample means = [2.90, 3.40]

This interval certainly contains the population mean. However, notice that it is contained in the bootstrapped confidence interval above, and we don't know what is the probability attached to this interval of it containing the mean.